Core Demo Dashboard¶
Objective: Create a Data Science Notebook, using ECommerce Dataset, to include forecasting visualisations of the created model for the demo dashboard shown to prospective clients.
Created By: Stephen Cole
Creation Date: 2023-04-06
Last Updated¶
%load_ext pretty_jupyter
from datetime import datetime, timedelta
print(f"Last Updated: {datetime.now().strftime('%d/%m/%Y %H:%M:%S')}")
Last Updated: 12/04/2023 16:52:35
Introduction¶
Brazilian E-Commerce Public Dataset by Olist Welcome. This is a Brazilian ecommerce public dataset of orders made at Olist Store. The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. It's features allows viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers. They also released a geolocation dataset that relates Brazilian zip codes to lat/lng coordinates.
A separate dataset was created to supplement the State codes and provide full description of the state, rather than just the code.
1.0 Preliminary Setup¶
1.1 Import Packages¶
import os
import time
import logging
import pickle
import yaml
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display, Markdown
from functions.DS_PreProcessing_Functions import reduce_mem_usage, find_column_type
pd.set_option('display.max_columns', None, 'display.max_rows', 500)
1.2 Load Config File¶
path_to_config = "C:/Users/Stephen.Cole/Dropbox/My PC (XT-LPT-012)/Documents/Core Consultants/Internal Dashboard/Python\
/python_config/python_config.yaml"
# Load config
with open (path_to_config,'r') as stream:
demo_dashboard_config = yaml.safe_load(stream)
1.3 Setup Log File¶
path_to_log = demo_dashboard_config['Paths']['path_to_log']
# Setting up log file
os.makedirs(path_to_log, exist_ok=True)
run_date = datetime.today().strftime('%Y-%m-%d')
log_file_name = run_date + "_demo_dashboard.log"
path_to_log_file = os.path.join(path_to_log, log_file_name)
logging.basicConfig(filename=path_to_log_file,
filemode='w',
format='%(asctime)s %(name)s %(levelname)s %(message)s',
level=logging.INFO,
datefmt='%Y-%m-%d')
LOGGER = logging.getLogger(__name__)
1.4 Load Data¶
def create_dfs(dataframe_names, file_paths, log = LOGGER):
dfs = {}
for name, file in zip(dataframe_names, file_paths):
try:
dfs[name] = pd.read_csv(file)
LOGGER.info(f"Data file {file} read SUCCESSFULLY")
LOGGER.info(f"Data file {file} has {dfs[name].shape} [rows, columns]")
except Exception as e:
LOGGER.info(f"Data file {file} read UNSUCCESSFULLY with exception {e}")
return dfs
var_names = ["Customers", "Geolocation", "Items", "Payments", "Reviews", "Orders",
"Products", "Sellers", "Products_translation"]
input_file_paths = [os.path.join(demo_dashboard_config['Paths']['path_to_data'], file) for file in demo_dashboard_config['Files']['input_csv']]
dfs = create_dfs(var_names, input_file_paths)
2.0 Data Model¶
There are 100k orders from 2016 to 2018, with 9 datasets to investigate. The Data Model can be described in the image below:
2.1 Principal Dataset¶
Order Reviews Dataset
- Review ID (
object) - Order ID (
object) - Review Score (
int64) - Review Comment Title (
object) - Review Comment Message (
object) - Review Creation Date (
object) Review Answer Timestamp (
object)NOTE:
review_comment_titleandreview_comment_messagewill be converted to English and passed through an NLP model to create a new features that will hopefully help contribute to predicting thereview_score
display(dfs["Reviews"].loc[~dfs["Reviews"]["review_comment_title"].isna(),:].set_index("review_id").head())
| order_id | review_score | review_comment_title | review_comment_message | review_creation_date | review_answer_timestamp | |
|---|---|---|---|---|---|---|
| review_id | ||||||
| 8670d52e15e00043ae7de4c01cc2fe06 | b9bf720beb4ab3728760088589c62129 | 4 | recomendo | aparelho eficiente. no site a marca do aparelh... | 2018-05-22 00:00:00 | 2018-05-23 16:45:47 |
| 3948b09f7c818e2d86c9a546758b2335 | e51478e7e277a83743b6f9991dbfa3fb | 5 | Super recomendo | Vendedor confiável, produto ok e entrega antes... | 2018-05-23 00:00:00 | 2018-05-24 03:00:01 |
| 373cbeecea8286a2b66c97b1b157ec46 | 583174fbe37d3d5f0d6661be3aad1786 | 1 | Não chegou meu produto | Péssimo | 2018-08-15 00:00:00 | 2018-08-15 04:10:37 |
| d21bbc789670eab777d27372ab9094cc | 4fc44d78867142c627497b60a7e0228a | 5 | Ótimo | Loja nota 10 | 2018-07-10 00:00:00 | 2018-07-11 14:10:25 |
| c92cdd7dd544a01aa35137f901669cdf | 37e7875cdce5a9e5b3a692971f370151 | 4 | Muito bom. | Recebi exatamente o que esperava. As demais en... | 2018-06-07 00:00:00 | 2018-06-09 18:44:02 |
2.2 Other Datasets¶
2.2.1 Customers Dataset¶
- Customer ID (
object) - Customer Unique ID (
object) - Customer ZIP Code Prefix (
int64) - Customer City (
object) - Customer State (
object)
display(dfs["Customers"].set_index("customer_id").head())
| customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | |
|---|---|---|---|---|
| customer_id | ||||
| 06b8999e2fba1a1fbc88172c00ba8bc7 | 861eff4711a542e4b93843c6dd7febb0 | 14409 | franca | SP |
| 18955e83d337fd6b2def6b18a428ac77 | 290c77bc529b7ac935b93aa66c333dc3 | 9790 | sao bernardo do campo | SP |
| 4e7b3e00288586ebd08712fdd0374a03 | 060e732b5b29e8181a18229c7b0b2b5e | 1151 | sao paulo | SP |
| b2b6027bc5c5109e529d4dc6358b12c3 | 259dac757896d24d7702b9acbbff3f3c | 8775 | mogi das cruzes | SP |
| 4f2d8ab171c80ec8364f7c12e35b23ad | 345ecd01c38d18a9036ed96c73b8d066 | 13056 | campinas | SP |
2.2.2 Geolocation Lookup¶
- Geolocation ZIP Code Prefix (
int64) - Geolocation Latitude (
float64) - Geolocation Longitude (
float64) - Geolocation City (
object) - Geolocation State (
object)
NOTE: This Lookup table will not be used for modelling so can be ignore throughout.
display(dfs["Geolocation"].set_index("geolocation_zip_code_prefix").head())
| geolocation_lat | geolocation_lng | geolocation_city | geolocation_state | |
|---|---|---|---|---|
| geolocation_zip_code_prefix | ||||
| 1037 | -23.545621 | -46.639292 | sao paulo | SP |
| 1046 | -23.546081 | -46.644820 | sao paulo | SP |
| 1046 | -23.546129 | -46.642951 | sao paulo | SP |
| 1041 | -23.544392 | -46.639499 | sao paulo | SP |
| 1035 | -23.541578 | -46.641607 | sao paulo | SP |
2.2.3 Orders Dataset¶
- Order ID (object)
- Customer ID (object)
- Order Status (object)
- Order Purchase Timestamp (object)
- Order Approved at (object)
- Order Delivered Carrier date (object)
- Order Estimated Delivery date (object)
display(dfs["Orders"].set_index("order_id").head())
| customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | |
|---|---|---|---|---|---|---|---|
| order_id | |||||||
| e481f51cbdc54678b7cc49136f2d6af7 | 9ef432eb6251297304e76186b10a928d | delivered | 2017-10-02 10:56:33 | 2017-10-02 11:07:15 | 2017-10-04 19:55:00 | 2017-10-10 21:25:13 | 2017-10-18 00:00:00 |
| 53cdb2fc8bc7dce0b6741e2150273451 | b0830fb4747a6c6d20dea0b8c802d7ef | delivered | 2018-07-24 20:41:37 | 2018-07-26 03:24:27 | 2018-07-26 14:31:00 | 2018-08-07 15:27:45 | 2018-08-13 00:00:00 |
| 47770eb9100c2d0c44946d9cf07ec65d | 41ce2a54c0b03bf3443c3d931a367089 | delivered | 2018-08-08 08:38:49 | 2018-08-08 08:55:23 | 2018-08-08 13:50:00 | 2018-08-17 18:06:29 | 2018-09-04 00:00:00 |
| 949d5b44dbf5de918fe9c16f97b45f8a | f88197465ea7920adcdbec7375364d82 | delivered | 2017-11-18 19:28:06 | 2017-11-18 19:45:59 | 2017-11-22 13:39:59 | 2017-12-02 00:28:42 | 2017-12-15 00:00:00 |
| ad21c59c0840e6cb83a9ceb5573f8159 | 8ab97904e6daea8866dbdbc4fb7aad2c | delivered | 2018-02-13 21:18:39 | 2018-02-13 22:20:29 | 2018-02-14 19:46:34 | 2018-02-16 18:17:02 | 2018-02-26 00:00:00 |
2.2.4 Sellers Lookup¶
- Seller ID (
object) - Seller ZIP Code Prefix (
int64) - Seller City (
object) - Seller State (
object)
display(dfs["Sellers"].set_index("seller_id").head())
| seller_zip_code_prefix | seller_city | seller_state | |
|---|---|---|---|
| seller_id | |||
| 3442f8959a84dea7ee197c632cb2df15 | 13023 | campinas | SP |
| d1b65fc7debc3361ea86b5f14c68d2e2 | 13844 | mogi guacu | SP |
| ce3ad9de960102d0677a81f5d0bb7b2d | 20031 | rio de janeiro | RJ |
| c0f3eea2e14555b6faeea3dd58c1b1c3 | 4195 | sao paulo | SP |
| 51a04a8a6bdcb23deccc82b0b80742cf | 12914 | braganca paulista | SP |
2.2.5 Product Category Name Translation¶
- Product Category Name (
object) - Product Category Name English (
object)
display(dfs["Products_translation"].set_index("product_category_name").head())
| product_category_name_english | |
|---|---|
| product_category_name | |
| beleza_saude | health_beauty |
| informatica_acessorios | computers_accessories |
| automotivo | auto |
| cama_mesa_banho | bed_bath_table |
| moveis_decoracao | furniture_decor |
2.2.6 Product Lookup¶
- Product ID (
object) - Product Category Name (
object) - Product Name Length (
float64) - Product Description Length (
float64) - Product Photos qty (
float64) - Product Weight g (
float64) - Product Length cm (
float64) - Product Height cm (
float64) - Product Width cm (
float64)
display(dfs["Products"].set_index("product_id").head())
| product_category_name | product_name_lenght | product_description_lenght | product_photos_qty | product_weight_g | product_length_cm | product_height_cm | product_width_cm | |
|---|---|---|---|---|---|---|---|---|
| product_id | ||||||||
| 1e9e8ef04dbcff4541ed26657ea517e5 | perfumaria | 40.0 | 287.0 | 1.0 | 225.0 | 16.0 | 10.0 | 14.0 |
| 3aa071139cb16b67ca9e5dea641aaa2f | artes | 44.0 | 276.0 | 1.0 | 1000.0 | 30.0 | 18.0 | 20.0 |
| 96bd76ec8810374ed1b65e291975717f | esporte_lazer | 46.0 | 250.0 | 1.0 | 154.0 | 18.0 | 9.0 | 15.0 |
| cef67bcfe19066a932b7673e239eb23d | bebes | 27.0 | 261.0 | 1.0 | 371.0 | 26.0 | 4.0 | 26.0 |
| 9dc1a7de274444849c219cff195d0b71 | utilidades_domesticas | 37.0 | 402.0 | 4.0 | 625.0 | 20.0 | 17.0 | 13.0 |
2.2.7 Order Items Dataset¶
- Order ID (
object) - Order Item ID (
int64) - Product ID (
object) - Seller ID (
object) - Shipping Limit Date (
object) - Price (
float64) - Freight Value (
float64)
display(dfs["Items"].set_index("order_id").head())
| order_item_id | product_id | seller_id | shipping_limit_date | price | freight_value | |
|---|---|---|---|---|---|---|
| order_id | ||||||
| 00010242fe8c5a6d1ba2dd792cb16214 | 1 | 4244733e06e7ecb4970a6e2683c13e61 | 48436dade18ac8b2bce089ec2a041202 | 2017-09-19 09:45:35 | 58.90 | 13.29 |
| 00018f77f2f0320c557190d7a144bdd3 | 1 | e5f2d52b802189ee658865ca93d83a8f | dd7ddc04e1b6c2c614352b383efe2d36 | 2017-05-03 11:05:13 | 239.90 | 19.93 |
| 000229ec398224ef6ca0657da4fc703e | 1 | c777355d18b72b67abbeef9df44fd0fd | 5b51032eddd242adc84c38acab88f23d | 2018-01-18 14:48:30 | 199.00 | 17.87 |
| 00024acbcdf0a6daa1e931b038114c75 | 1 | 7634da152a4610f1595efa32f14722fc | 9d7a1d34a5052409006425275ba1c2b4 | 2018-08-15 10:10:18 | 12.99 | 12.79 |
| 00042b26cf59d7ce69dfabb4e55b4fd9 | 1 | ac6c3623068f30de03045865e4e10089 | df560393f3a51e74553ab94004ba5c87 | 2017-02-13 13:57:51 | 199.90 | 18.14 |
2.2.8 Order Payments Dataset¶
- Order ID (
object) - Payment Sequential (
int64) - Payment Type (
object) - Payment Installments (
int64) - Payment Value (
float64)
display(dfs["Payments"].set_index("order_id").head())
| payment_sequential | payment_type | payment_installments | payment_value | |
|---|---|---|---|---|
| order_id | ||||
| b81ef226f3fe1789b1e8b2acac839d17 | 1 | credit_card | 8 | 99.33 |
| a9810da82917af2d9aefd1278f1dcfa0 | 1 | credit_card | 1 | 24.39 |
| 25e8ea4e93396b6fa0d3dd708e76c1bd | 1 | credit_card | 1 | 65.71 |
| ba78997921bbcdc1373bb41e913ab953 | 1 | credit_card | 8 | 107.78 |
| 42fdf880ba16b47b59251dd489d4441a | 1 | credit_card | 2 | 128.45 |
3.0 Creating the "Single Source of Truth" Table¶
Will join all the relevant tables together to create a single table that contains all the useful features. This will make data cleaning, standardisation, and engineering easier as it will only need to go through one cleaning process, rather than 8 separate cleans.
Not all the columns/datasets will be used, so will drop accordingly.
Some useful features can also be created from existing features. For example, waiting duration could affect the rating_score.
df_names = list(dfs.keys())
df_names.remove("Reviews")
df_names.remove("Orders")
df_names.remove("Geolocation")
modelling_df = pd.merge(dfs["Reviews"], dfs["Orders"], on="order_id", how="left", suffixes=('', '_y'))
for df in df_names:
print(f"{df} DataFrame has shape {dfs[df].shape}")
if "order_id" in dfs[df].columns:
modelling_df = modelling_df.merge(dfs[df], on="order_id", how="left", suffixes=('', '_y'))
elif "customer_id" in dfs[df].columns:
modelling_df = modelling_df.merge(dfs[df], on="customer_id", how="left", suffixes=('', '_y'))
elif "product_id" in dfs[df].columns:
modelling_df = modelling_df.merge(dfs[df], on="product_id", how="left", suffixes=('', '_y'))
elif "seller_id" in dfs[df].columns:
modelling_df = modelling_df.merge(dfs[df], on="seller_id", how="left", suffixes=('', '_y'))
else:
modelling_df = modelling_df.merge(dfs[df], on="product_category_name", how="left", suffixes=('', '_y'))
print(f"{df} has been merged to truth table \n")
cols_reordered = ['review_id', 'order_id','customer_id', 'product_id', 'seller_id', 'product_category_name',
'review_comment_title','review_comment_message', 'review_creation_date',
'review_answer_timestamp', 'order_status',
'order_purchase_timestamp', 'order_approved_at',
'order_delivered_carrier_date', 'order_delivered_customer_date',
'order_estimated_delivery_date', 'customer_unique_id',
'customer_zip_code_prefix', 'customer_city', 'customer_state',
'order_item_id', 'shipping_limit_date',
'price', 'freight_value', 'payment_sequential', 'payment_type',
'payment_installments', 'payment_value',
'product_name_lenght', 'product_description_lenght',
'product_photos_qty', 'product_weight_g', 'product_length_cm',
'product_height_cm', 'product_width_cm', 'seller_zip_code_prefix',
'seller_city', 'seller_state', 'product_category_name_english', 'review_score']
modelling_df.drop_duplicates(subset=["order_id","customer_id","product_id","seller_id","product_category_name"],
inplace=True, keep="last")
display(modelling_df[cols_reordered].head())
Customers has shape (99441, 5) Customers has been merged to truth table Items has shape (112650, 7) Items has been merged to truth table Payments has shape (103886, 5) Payments has been merged to truth table Products has shape (32951, 9) Products has been merged to truth table Sellers has shape (3095, 4) Sellers has been merged to truth table Products_translation has shape (71, 2) Products_translation has been merged to truth table
| review_id | order_id | customer_id | product_id | seller_id | product_category_name | review_comment_title | review_comment_message | review_creation_date | review_answer_timestamp | order_status | order_purchase_timestamp | order_approved_at | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | order_item_id | shipping_limit_date | price | freight_value | payment_sequential | payment_type | payment_installments | payment_value | product_name_lenght | product_description_lenght | product_photos_qty | product_weight_g | product_length_cm | product_height_cm | product_width_cm | seller_zip_code_prefix | seller_city | seller_state | product_category_name_english | review_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 7bc2406110b926393aa56f80a40eba40 | 73fc7af87114b39712e6da79b0a377eb | 41dcb106f807e993532d446263290104 | fd25ab760bfbba13c198fa3b4f1a0cd3 | 6d803cb79cc31c41c4c789a75933b3c7 | esporte_lazer | NaN | NaN | 2018-01-18 00:00:00 | 2018-01-18 21:46:59 | delivered | 2018-01-11 15:30:49 | 2018-01-11 15:47:59 | 2018-01-12 21:57:22 | 2018-01-17 18:42:41 | 2018-02-02 00:00:00 | 68a5590b9926689be4e10f4ae2db21a8 | 6030 | osasco | SP | 2.0 | 2018-01-18 15:47:59 | 185.00 | 13.63 | 1.0 | credit_card | 8.0 | 397.26 | 42.0 | 858.0 | 1.0 | 1300.0 | 30.0 | 30.0 | 35.0 | 14600.0 | sao joaquim da barra | SP | sports_leisure | 4 |
| 2 | 80e641a11e56f04c1ad469d5645fdfde | a548910a1c6147796b98fdf73dbeba33 | 8a2e7ef9053dea531e4dc76bd6d853e6 | be0dbdc3d67d55727a65d4cd696ca73c | 8e6d7754bc7e0f22c96d255ebda59eba | informatica_acessorios | NaN | NaN | 2018-03-10 00:00:00 | 2018-03-11 03:05:13 | delivered | 2018-02-28 12:25:19 | 2018-02-28 12:48:39 | 2018-03-02 19:08:15 | 2018-03-09 23:17:20 | 2018-03-14 00:00:00 | 64190b91b656ab8f37eb89b93dc84584 | 13380 | nova odessa | SP | 1.0 | 2018-03-06 12:48:39 | 79.79 | 8.30 | 1.0 | credit_card | 1.0 | 88.09 | 47.0 | 493.0 | 1.0 | 245.0 | 19.0 | 14.0 | 14.0 | 12233.0 | sao jose dos campos | SP | computers_accessories | 5 |
| 3 | 228ce5500dc1d8e020d8d1322874b6f0 | f9e4b658b201a9f2ecdecbb34bed034b | e226dfed6544df5b7b87a48208690feb | d1c427060a0f73f6b889a5c7c61f2ac4 | a1043bafd471dff536d0c462352beb48 | informatica_acessorios | NaN | NaN | 2018-02-17 00:00:00 | 2018-02-18 14:36:24 | delivered | 2018-02-03 09:56:22 | 2018-02-03 10:33:41 | 2018-02-06 16:18:28 | 2018-02-16 17:28:48 | 2018-03-09 00:00:00 | 1d47144362c14e94ccdd213e8ec277d5 | 44571 | santo antonio de jesus | BA | 1.0 | 2018-02-08 10:31:15 | 149.00 | 45.12 | 1.0 | credit_card | 1.0 | 194.12 | 59.0 | 1893.0 | 1.0 | 6550.0 | 20.0 | 20.0 | 20.0 | 37175.0 | ilicinea | MG | computers_accessories | 5 |
| 4 | e64fb393e7b32834bb789ff8bb30750e | 658677c97b385a9be170737859d3511b | de6dff97e5f1ba84a3cd9a3bc97df5f6 | 52c80cedd4e90108bf4fa6a206ef6b03 | a1043bafd471dff536d0c462352beb48 | ferramentas_jardim | NaN | Recebi bem antes do prazo estipulado. | 2017-04-21 00:00:00 | 2017-04-21 22:02:06 | delivered | 2017-04-09 17:41:13 | 2017-04-09 17:55:19 | 2017-04-10 14:24:47 | 2017-04-20 09:08:35 | 2017-05-10 00:00:00 | c8cf6cb6b838dc7a33ed199b825e8616 | 88735 | gravatal | SC | 1.0 | 2017-04-13 17:55:19 | 179.99 | 42.85 | 1.0 | credit_card | 1.0 | 222.84 | 33.0 | 2188.0 | 2.0 | 7650.0 | 20.0 | 20.0 | 20.0 | 37175.0 | ilicinea | MG | garden_tools | 5 |
| 5 | f7c4243c7fe1938f181bec41a392bdeb | 8e6bfb81e283fa7e4f11123a3fb894f1 | 5986b333ca0d44534a156a52a8e33a83 | 3880d25d502b15b1de6fddc42ad1d67a | 989becdce12ebc39863c2bceab6f3ca1 | esporte_lazer | NaN | Parabéns lojas lannister adorei comprar pela I... | 2018-03-01 00:00:00 | 2018-03-02 10:26:53 | delivered | 2018-02-10 10:59:03 | 2018-02-10 15:48:21 | 2018-02-15 19:36:14 | 2018-02-28 16:33:35 | 2018-03-09 00:00:00 | d16000272660a1fef81482ad75ba572a | 89520 | curitibanos | SC | 1.0 | 2018-02-15 15:48:21 | 1199.00 | 134.25 | 1.0 | credit_card | 10.0 | 1333.25 | 48.0 | 562.0 | 5.0 | 9850.0 | 105.0 | 35.0 | 50.0 | 81730.0 | curitiba | PR | sports_leisure | 5 |
modelling_df.info(verbose=True)
# Memory optimisation of DataFrame - see DS_PreProcessing_Functions.py file for detail
modelling_df = reduce_mem_usage(modelling_df)
Memory usage of dataframe is 32.03 MB Memory usage after optimization is: 43.56 MB Decreased by -36.0%
modelling_df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'> Int64Index: 102384 entries, 1 to 118145 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 review_id 102384 non-null category 1 order_id 102384 non-null category 2 review_score 102384 non-null int8 3 review_comment_title 12124 non-null category 4 review_comment_message 42870 non-null category 5 review_creation_date 102384 non-null category 6 review_answer_timestamp 102384 non-null category 7 customer_id 102384 non-null category 8 order_status 102384 non-null category 9 order_purchase_timestamp 102384 non-null category 10 order_approved_at 102229 non-null category 11 order_delivered_carrier_date 100619 non-null category 12 order_delivered_customer_date 99502 non-null category 13 order_estimated_delivery_date 102384 non-null category 14 customer_unique_id 102384 non-null category 15 customer_zip_code_prefix 102384 non-null int32 16 customer_city 102384 non-null category 17 customer_state 102384 non-null category 18 order_item_id 101628 non-null float32 19 product_id 101628 non-null category 20 seller_id 101628 non-null category 21 shipping_limit_date 101628 non-null category 22 price 101628 non-null float32 23 freight_value 101628 non-null float32 24 payment_sequential 102383 non-null float32 25 payment_type 102383 non-null category 26 payment_installments 102383 non-null float32 27 payment_value 102383 non-null float32 28 product_category_name 100180 non-null category 29 product_name_lenght 100180 non-null float32 30 product_description_lenght 100180 non-null float32 31 product_photos_qty 100180 non-null float32 32 product_weight_g 101612 non-null float32 33 product_length_cm 101612 non-null float32 34 product_height_cm 101612 non-null float32 35 product_width_cm 101612 non-null float32 36 seller_zip_code_prefix 101628 non-null float32 37 seller_city 101628 non-null category 38 seller_state 101628 non-null category 39 product_category_name_english 100158 non-null category dtypes: category(24), float32(14), int32(1), int8(1) memory usage: 43.6 MB
# Check each column for NULLS
null_dict = {}
for df_name, df in dfs.items():
null_dict[df_name] = [col for col in df.columns if any(df[col].isna())]
print(null_dict)
null_dict
NOTE:
review_comment_titleandreview_comment_messagewont be used in the model, so can be dropped instantly